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In the biomedical field due to tremendous medical research, every year the 
medical papers generated is very huge. In order to process these tremendous 
amounts of textual data, the information retrieval techniques are combined 
with machine learning methods like natural language processing (NLP) for 
finding the useful insights in data for NLP tasks. In this work, we have 
addressed the challenges involved in existing NLP techniques like Word2Vec, 
Doc2Vec, and Biosent2Vec and followed by the traditional classification of 
articles with Concept2Vec and random forest classification. Concept vector 
techniques are emerging methods in NLP and in information retrieval 
systems. Pretrained concept encoders are available in Machine learning, 
whereas no such exists in biomedical domain. In this paper a proposal on 
Concept2Vec trained from biomedical text is highlighted. Our work 
accurately emphasizes on the quality of semantics defined between the 
medical terms, which can help the medical practitioners in taking errorless 
decisions. In this work 20 thousand biomedical documents are considered for 
Concept2Vec embeddings and the proposed approach have shown best 
concept similarity in terms of relevant measures compared with existing 
approaches such as Word2Vec, global vectors (GloVe), fastText, and 
BioSentVec, bidirectional encoder representations from transformers 
(BERT), biomedical-BERT. 
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1. INTRODUCTION 


In machine learning the latest advances in human replacement demands the need of a machine to 
understand the natural language. In this context language grams always played an important role. Language 
modelling has become the need of the hour in natural language processing by machines [1]. With the invention 
of language models, the machine is able to predict the next sequence of the text in sentence formation, the 
probability of the word sequences and topic sequences in recommendation system [2]. Language models are used 
in many of the machine learning applications such as sentence tagging, language translation, voice recognition, 
and question and answering systems, keyword search mechanisms, text processing (voice-text and text-voice) [3]. 
The language models vary depends on the machine application and generalised as two types. 

Statistical machine language models: statistical WordNet defines the probabilistic relations between 
the words and the next sequence of the word [4]. It defines a chain rule method for generating the sequence. 
The probability of sequence of terms defined as shown in (1). 


P (w1w2w3w4) =p (w1) p (w2|w1) p (w3|w1w2) p (w4|w1w2w3) (1) 
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Example of statistical language modelling is unigram, bigram, trigram, bidirectional and so on N- 
grams and exponential. Unigram of the sentence is defined as shown in (2): 


Pryyi (W1w2w3w4) = p (W1) p (W2) p (W3) p (W4) (2) 
trigram of the sentence is defined as shown in (3). 
Pry W1w2w3w4) = p (wl) p (W2|w1) p (w3|w2) p (W4|w3) (3) 


Neural machine language models: in this type of model, the language system is defined using the 
concepts of neural network [5]. The advanced model of neural networks are used to enhance the functionality 
of the conventional language representation and to improve the accuracy of the machine learning application 
of natural processing such as machine translation, sentiment analysis, personal assistance and recommendation 
systems [6]. Biomedical documents are represented as bibliographic data bases using most important key words 
in that document and each document is compared with other document using the similarity of these bag of 
keywords and a semantic corpus with related documents is created and with this the machine can give better 
performance for the natural language processing tasks [7]. Generating key terms from medical documents and 
creating a hierarchal model based on the relationships of these terms generates a language model which is 
named as probabilistic language model. These probabilistic models generate the next sequence in the sentence 
formation [8]. 


2. RELATED WORK 

Learned word representations defined the semantic and syntactic symmetries of the terms in the 
documents. Every term is defined as the vector then the singular and plural form of the word regularities are 
defined then the similarity between these vectors is calculated with cosine similarity measure, based on the 
semantic relation a word network model is defined [9]. Cosine similarity of the terms is defined as in [9]. 


Cos(x1,x2) = x1.x2/|x|ly| (4) 


In (4), x1 and x2 are two-word vectors and |x| represents the Euclidian form of the vector. In the 
published medical research articles, more than 7 billion words are available because of the advancements in 
language processing applications this unlabeled large corpus can be used for generating the statistical WordNet 
models for various applications such as text classification, named entity recognition, query processing [10]. 

The National Center for Biomedical Information (NCBI) has largest PubMed medical literature data 
source which has more than 30 million citation database and 16 million medical article abstracts. These huge 
unannotated texts transformed in to structure for experimenting many biomedical natural language processing 
(NLP) tasks [11]. These medical articles sources having large unannotated openly accessible PubMed central 
statistics as Table 1 [12], [13]. 


Table 1. PubMed and Medline repository statistics in terms of documents, sentences and topics 
S.No PubMed Medline Total 
Articles 28,150,357 896,589 29,046,946 
Sentences 134,665,774 123,178,654 257,844,428 
words 4,866,356,547 _3,609,534,655 _ 8,475,891,202 


From the precise huge statistics sets as shown in Table 1, the unremitting vector word representation 
similarity is computed and is compared with the previously outperformed neural network models and it was 
proved that the word vector representation has outperformed on 1.6 billion words. Learning high quality vectors 
from the large corpus is the goal of this work [14]. Feedforward recurrent neural network language model is 
defined on the above large corpus and this is outperformed compared with the semantic architectures such as 
latent Dirichlet allocation and latent semantic analysis. It consists of four layers in the neural network model, 
first layer is input layer and the projection layer, hidden layer and output layer [15]. For learning semantic and 
syntactic regularities of the word representation, the frequency of the term occurrence in the corpus is the main 
source. The co-occurrence probabilities are extracted from the global vectors which are explored from global 
data, this model is GloVe because which includes the global corpus statistics [16]. 
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The co-occurrence probability is defined from Global Statistics as P (j/i) i.e., the number of times the 
word j occurs in the context of i [17]. Over the years the learning vector space word representations such as 
Word2Vec, fastText and GloVe has added a prominence over better-quality feature depiction in bagging the 
biomedical domain knowledge and also in natural language processing tasks [18]. Most of these techniques 
failed to capture the deep contextual relationships of the terms present in documents, so a novel context-based 
word embedding model was defined on one of the largest biomedical dataset PubMed, as the PubMed data set 
article statistics is mentioned in Table 1. To perform some of the applications in biomedicine like drug 
discovery, protein sequence analysis or concept search and question and answering a deep understanding of 
the biomedical concepts is required. The main challenge in word embeddings is that representation of words 
through word learning model. A composite embeddings and context aware embeddings are defined and 
compared with Bio-WordVec and the model outperformed [19]. 

Sentence similarity is an important task in medical text mining study and the traditional methods like 
bag of words may not define a topic model so accurately due to word ambiguity, so BioSentVec is defined 
with sentence embeddings learned from recurrent Pretrained sentence vectors [20]. Identification and mining 
of terms from large corpus is a laborious work to address this TermInformer project is proposed in this the 
system automatically mines the terms with the meaning form the biomedical documents and the system also 
reuses the existing generated word embeddings. This work compared the automatically mined term 
embeddings with the pretrained word embeddings and the result of the comparison showed that the auto mining 
of terms have shown better accuracy in defining semantic relationships [21]. 

Generation of biomedical word embeddings even it’s emerging with unstructured data recently many 
embeddings’ generations on structured data proposed. Term embedding approaches on structured data doesn’t 
have a systematic method for computing the quality of these word similarity embeddings and performance of 
these embeddings in NLP chores. A structure comprising three dissimilar tasks associated with the aspects of 
ontological concepts: i) the categorization, ii) the hierarchal, and iii) the relational features is proposed and in 
different aspect the metrics for evaluating the quality of the task is also suggested [22]. 

In biomedical text natural language processing (BioNLP) applications and research still the statistical 
language modeling is a primary and essential step even there is continuous improvement in contextual based 
word embeddings using neural methods. Several neural architectures are proposed to meet the specific BioNLP 
task objectives and downstream applications. Biomedical natural language processing related tasks are 
continuously showing much improvement in adapting the neural network and architectures and latest advances 
of the networks and pretrained transformers using long short-term memory (LSTM), recurrent neural networks 
(RNN) techniques and bidirectional encoder and decoder techniques [23]. 

Existing word embedding methods Word2Vec and GloVe model have limitations in terms word vector 
representation where each word is represented as vector and internal sub structure of the word is ignored. More 
over the existing methods concentrate on generating embeddings from one large corpus ignoring the domain 
knowledge. If a topics model suggested can apprehend the sub term information and if can generate the insights 
of the words then it will greatly benefit the BioNLP applications. For each word generated the sub word 
information or heading information is captured with the concept of meta thesaurus these generated sub headings 
exploited as medical subject headings (MeSH). With this domain specific medical headings knowledge the 
medical word embeddings can qualitative in terms of BioNLP application performance [24]. 

PMCVec is an unsupervised technique used to generate the phrase embeddings from the PubMed 
abstracts. Phrase embeddings further explore the possibility of N-gram structures of the word embeddings 
whereas the existing models only suggested the unigram representation of the words. PMCVec proposed multi 
step model; i) biomedical article pre-processing, 11) using chunking approach generating phrases from this 
medical articles, and iii) filtering the generated phrases and ranking and re-ranking iv) creating distributed 
representation of phrases by phrase tagging [25]. 

Unified medical language (UML) system is defined from medical subheadings concept modeling. The 
phrases in this meta thesaurus model are combined to generate paraphrases of the unified medical language 
concepts. The paraphrases are created in three phases of this UML concept modeling i) feature collection, 11) 
feature selection, and iii) UML concept modeling [12]. 

Word embeddings are trained from four diverse repositories like biomedical publications, electronic 
health records, news articles and Wikipedia and these embeddings are evaluated against quantitative and 
qualitative approach and the medical subheadings considered from three subjects’ disorder, symptoms, and 
drug. Intrinsic and extrinsic quantitative evaluation methods are used to evaluate the quality of the generated 
embeddings from the three categories. In intrinsic evaluation the word embeddings are checked with respect 
to medical semantics i.e., measuring the semantic similarity between the medical terms [13]. Sentence 
embeddings and the other word embeddings techniques in the biomedical articles domain require a unified 
methodology for assessment, so medical sentence evaluation system is proposed for ten different classification 
tasks. The novel MedSentEvalis proposed for BioNLP problems [14]. 
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3. DOCUMENT REPRESENTATIONS METHOD 

Word representation is depending on the distributional proposition. The terms belonging to same 
distributions share similar meanings. To represent the medical document as medical term model many language 
representation models proposed [15]. 

The word vector representation consists of chain rule probability distribution of words and their co- 
occurrence. Many approaches have been proposed on different corpuses and for biomedical domain also among 
these proposed models many have been implemented in various application like: i) keyword concept search, 
ii) question and answering system, and iii) medical recommendations [15]. 

The proposed models on biomedical domain in the early studies are bag-of-words model to represent 
the document words and their co-occurrence as vectors. To represent a bag of words (BoW) model for medical 
articles a high dimensional matrix model is used and many methods have been proposed for normalizing this 
representations like latent semantic indexing (LSA), and many more representation in neural network model 
like (PCA) principal component analysis and random indexing (RI) for reducing the high dimensional sparse 
matrix to low dimensional [15]. 

Word vectors have been upgraded to embeddings using neural networks over the years with the 
different models like global vectors neural model [4], fastText model [3], and Word2Vec model [3]. 
Embeddings from language model [15], bidirectional encoding representations from transformers (BERT) [15]. 
Medical word embeddings learned from biomedical copra as part of neural network language models have 
been used for performing specific tasks such as protein sequence identification, gene bank copra analysis, and 
gene structure analysis. 

In the BoW neural model as shown in Figure 1, the similar words co-occurrence is processed in 
different layers of the neural network and a language model is developed with word sequence generation as 
shown in Figure 2. The input documents are vectored based on the Word2Vec model and the input layer is fed 
with N-input Word2Vecs generated from MED Corpus and the sequence similarity is processed in the hidden 
layers and language is constructed from generated sequent words in the dense layers and the output layer is 
sequence od terms generated which will help in taking crucial medical decisions like medicine 
recommendation, decease prediction. The input Word2Vec consists of the vectors with frequency of the word 
in medical document and the weights are assigned to these terms calculated from frequency and the inverse 
term frequency based on these weights the cosine similarity of the words calculated. 


Word2Vec 
Similar words are bagged to 
find the word sequence and 
MED _ co-occurrence. 
Document-1 
Word2Vec 
+f 
MED 
Document-2 
—_——> 


. Word2Vec 
MED . / 


Document-N 


Figure 1. Bag of words generation from medical documents 
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Figure 2. Word sequence generation using recurrent neural network 


4. PROPOSED CONCEPT MODELING METHOD (CONCEPT2VEC) 

Term similarity task is extended with concept similarity in this paper and the complexity in sparse 
matrix is reduced using concept embeddings generation from medical corpus. Concepts can be generated from 
the N-grams constructed from the medical documents. These N-grams probability of co-occurrence is 
calculated and the values are vectorized as concept model 2-gram, 3-gram and so on N-gram. 

The N-gram is constructed from the N words of medical document as: 


Document —1 = {W1,W2... Wn} 1... vector of words 
N — Grampoc-1 = P (W1|W2... Wn}, P (W2|W3... Wn)...P (Wn|W1...Wn — 1) 


a concept hierarchy is generated, using this hierarchal model the next sequence of sentence is predicted with a 
98% accuracy compare with other existing models. In the generation of this concept hierarchy a term frequency 
measure is a basic for generating the key terms. The key terms generated importance is calculated the weight 
function calculated from inverse document frequency [16]. 


Term — Weight (tw) = TF /IDF 


These weighted terms and their similarity will generate the set of similar terms and that can be called 
as a concept [16]. 


Concept = (tw1,tw2...twn) 
C11 = P (tw11,tw12...twli) 


N —Grampoc-1 = Document -—1 = 
(C11, C12... C1m), (C11, C12... C1n), (C11, C12... C10) ..........(€11, C12 ... C1z) 


N — Grampoc-2 = Document — 2 = 
(C21, C22... C2m), (C21, C22... C2n), (C21, C22... C20) ........(€21, C22 ... C2z) 


N — Grampoc-j = Document —J = 
(CJ1, C2... CJm), (CJ1, CJ2 ... CJn), (CJ1, CJ2 ... CJo) ... ... ... (CJ1, CJ2 ... CJz) 


Each N-gram generated from the medical documents can be considered to be a set of concepts and 
using concept similarity measures the correlation between the concepts calculated to generate the bag of 
concepts (BoC). BoC is the set of vectors generated from with different concept from PubMed documents. 
Using recurrent neural model, a concept language model can be generated from the concept hierarchy 
generation. From a medical document different set of correlated concepts generated with a probability of the 
N-Grams generated from the same documents and using different dense layers of the neural network a concept 
hierarchy model is generated which can predict the next sequence using the current sequence as shown in 
Figure 3. 
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(C11, C12... Clm), (C11, C12... Cn), (C11, C12... Clo)... C11, C12... C1z) 


(C21, C22..." 2 EX C20) 5.0 cieceee (C21, C22... C2z) 


(cn, CYS... CIm), (CN, C32... CIn) YaIN CI2... CJo).... > .(CU1, C32... CIz) 


Figure 3. Example ConceptNet model 


5. EXPERIMENTATION RESULTS 

In this paper we have used the medical documents of PubMed abstracts of NCBI and considered based 
on the concept search “protein- protein interaction”. PubMed central have interface to collect biomedical 
documents using e-utilities of national library of medicine and we have also collected 20k medical abstracts 
using the same utilities with the keyword “protein-protein interaction”. Top 20 thousand documents are 
considered for the research, in this model the collected abstracts are processed in different layers of recurrent 
neural model for generating the Concept2Vecs. The process followed in this model as shown in Figure 4 is: i) 
abstract pre-processing, ii) BoC generation, and iii) concept modeling (Concept2Vec). 


Abstract Processing (Key Terms Generation) ua 
Bag of Concepts Generation (BoC) 


Recurrent neural layer model for Bag of Concept (BoC) to 
Generate Concept2Vec 


NEXT SEQUENCE SENTENCE GENERATION FROM CURRENT 


Figure 4. Concept 2Vec generation proposed model 


Extracted documents are pre-processed with stop word removal and stemming methods for generating 
tokens of each document. Each token importance is calculated with the weight measure and all-important 
tokens of the documents are identified and indexed in matrix and the matrix size is reduced with the 
dimensionality reduction methods. The important terms are combined as a set then the probability of co- 
occurrence is calculated and indexed as vector which is named as concept vector. These generated concept 
vectors are processed in different layers of neural network as multi-layer perceptron (MLP) model. This MLP 
model is helps in identifying the sentence sequence. 

The MLP model with BOC is represented as: 


MLP (BOC) = Layery(SoftMax (Ln — 1 (RELU (Ln — 2 (..... (RELU (Concept2Vec)))) (5) 
the next generated sequence is evaluated with the accuracy measures such as precision, recall and f-measure 


against different aspect of embeddings. Different combination of embeddings and similarity measures results 
evaluated are listed as follows shown in Table 2 [5]. 
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Table 2. Different training set and the similarity measure results on two embedding models 


Data set Similarity measure | Word2Vec embeddings model _ Concept2Vec embeddings model 
5k Tokenset Cosine 0.66 0.80 
10k Tokenset Cosine 0.67 0.85 
15k Tokenset Cosine 0.67 0.89 
20k Tokenset Cosine 0.69 0.98 


Different model combinations are experimented and generated results are compared and listed as 
shown in Table 3. The evaluation results of above represent the different method combination for generating 
the sentence sequence from history of concept hierarchy. The experiments results reveal that using Word2Vec 
model the next word generation isn’t accurate enough and the existing unsupervised models are less accurate 
and complex compared to the Concept2Vec model. 


Table 3. Embeddings evaluation against different combination methods 


Method Precision Recall | Fl measure 
Word2Vec 0.648 0.628 0.637 
Word2Vec+RNN 0.725 0.757 0.766 
Concept2Vec 0.834 0.864 0.852 


Concept2 Vec+RNN 0.986 0.972 0.984 


6. CONCLUSION 

In this research, we have addressed the problem of matrix sparsity and defined a Concept2Vec model. 
Word distributions and representations increases the matrix sparsity number of dimensions will increase based 
on the single word distribution. So, we have defined the concepts from set of words and using these concepts 
a Concept2Vec model is defined for reducing number of dimensions. The Concept2Vec network is processed 
recurrently in different layers of the network to predict the next sequence of words and this model has 
outperformed every other method combination of the existing models. 
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